ABSTRACT

To nudge the state of the art of human-machine interacting applications, research in speech recognition systems has progressively been examining speech-to-text synthesis, but implementation has been done to minimal languages. Although the Bengali language has not been much of an object of interest, we present the automatic speech recognition (ASR) system solely based on this particular language since around 16% of the world’s population speak Bengali. It has been a demanding task to implement Bengali ASR because it consists of diacritic characters. We conduct a series of preprocessing and feature selection methods along with a convolutional neural net model in consideration of an automatic verbal communication recognition system. Furthermore, the researchers compared this method to a recurrent neural network that is based on an LSTM network and a vast data file of Google Inc. Investigation of these two models indicates such as the recurrent neural net outperforms the convolutional neural net: the former benefits from combining connectionist temporal classification (CTC) and language model (LM). A quantitative analysis of the output shows that the word error rate and validation loss can be affected by variation in dropout values. It also shows that the parameters are also affected by clean and augmented data.

Keywords: - Convolutional Neural Network, CTC, Word Error Rate, Edit Distance, Augmented Data, Test Loss, Validation Loss, Clean Data, Graphical User Interface